{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Deep Kernel Learning GP Regression (w/ KISS-GP)\n",
    "\n",
    "## Overview\n",
    "\n",
    "In this notebook, we'll give a brief tutorial on how to use deep kernel learning for regression on a medium scale dataset using SKI. This also demonstrates how to incorporate standard PyTorch modules in to a Gaussian process model. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import math\n",
    "import torch\n",
    "import gpytorch\n",
    "from matplotlib import pyplot as plt\n",
    "\n",
    "# Make plots inline\n",
    "%matplotlib inline"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Loading Data\n",
    "\n",
    "For this example notebook, we'll be using the `elevators` UCI dataset used in the paper. Running the next cell downloads a copy of the dataset that has already been scaled and normalized appropriately. For this notebook, we'll simply be splitting the data using the first 80% of the data as training and the last 20% as testing.\n",
    "\n",
    "**Note**: Running the next cell will attempt to download a ~400 KB dataset file to the current directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import urllib.request\n",
    "import os.path\n",
    "from scipy.io import loadmat\n",
    "from math import floor\n",
    "\n",
    "if not os.path.isfile('elevators.mat'):\n",
    "    print('Downloading \\'elevators\\' UCI dataset...')\n",
    "    urllib.request.urlretrieve('https://drive.google.com/uc?export=download&id=1jhWL3YUHvXIaftia4qeAyDwVxo6j1alk', 'elevators.mat')\n",
    "    \n",
    "data = torch.Tensor(loadmat('elevators.mat')['data'])\n",
    "X = data[:, :-1]\n",
    "X = X - X.min(0)[0]\n",
    "X = 2 * (X / X.max(0)[0]) - 1\n",
    "y = data[:, -1]\n",
    "\n",
    "# Use the first 80% of the data for training, and the last 20% for testing.\n",
    "train_n = int(floor(0.8*len(X)))\n",
    "\n",
    "train_x = X[:train_n, :].contiguous().cuda()\n",
    "train_y = y[:train_n].contiguous().cuda()\n",
    "\n",
    "test_x = X[train_n:, :].contiguous().cuda()\n",
    "test_y = y[train_n:].contiguous().cuda()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining the DKL Feature Extractor\n",
    "\n",
    "Next, we define the neural network feature extractor used to define the deep kernel. In this case, we use a fully connected network with the architecture `d -> 1000 -> 500 -> 50 -> 2`, as described in the original DKL paper. All of the code below uses standard PyTorch implementations of neural network layers."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_dim = train_x.size(-1)\n",
    "\n",
    "class LargeFeatureExtractor(torch.nn.Sequential):           \n",
    "    def __init__(self):                                      \n",
    "        super(LargeFeatureExtractor, self).__init__()        \n",
    "        self.add_module('linear1', torch.nn.Linear(data_dim, 1000))\n",
    "        self.add_module('relu1', torch.nn.ReLU())                  \n",
    "        self.add_module('linear2', torch.nn.Linear(1000, 500))     \n",
    "        self.add_module('relu2', torch.nn.ReLU())                  \n",
    "        self.add_module('linear3', torch.nn.Linear(500, 50))       \n",
    "        self.add_module('relu3', torch.nn.ReLU())                  \n",
    "        self.add_module('linear4', torch.nn.Linear(50, 2))         \n",
    "                                                             \n",
    "feature_extractor = LargeFeatureExtractor().cuda()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Defining the GP Model\n",
    "\n",
    "We now define the GP model. For more details on the use of GP models, see our simpler examples. This model uses a `GridInterpolationKernel` (SKI) with an RBF base kernel. \n",
    "\n",
    "### The forward method\n",
    "\n",
    "In deep kernel learning, the forward method is where most of the interesting new stuff happens. Before calling the mean and covariance modules on the data as in the simple GP regression setting, we first pass the input data `x` through the neural network feature extractor. Then, to ensure that the output features of the neural network remain in the grid bounds expected by SKI, we scales the resulting features to be between 0 and 1.\n",
    "\n",
    "Only after this processing do we call the mean and covariance module of the Gaussian process. This example also demonstrates the flexibility of defining GP models that allow for learned transformations of the data (in this case, via a neural network) before calling the mean and covariance function. Because the neural network in this case maps to two final output features, we will have no problem using SKI."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "class GPRegressionModel(gpytorch.models.ExactGP):\n",
    "        def __init__(self, train_x, train_y, likelihood):\n",
    "            super(GPRegressionModel, self).__init__(train_x, train_y, likelihood)\n",
    "            self.mean_module = gpytorch.means.ConstantMean()\n",
    "            self.covar_module = gpytorch.kernels.GridInterpolationKernel(\n",
    "                gpytorch.kernels.ScaleKernel(gpytorch.kernels.RBFKernel(ard_num_dims=2)),\n",
    "                num_dims=2, grid_size=100\n",
    "            )\n",
    "            self.feature_extractor = feature_extractor\n",
    "\n",
    "        def forward(self, x):\n",
    "            # We're first putting our data through a deep net (feature extractor)\n",
    "            # We're also scaling the features so that they're nice values\n",
    "            projected_x = self.feature_extractor(x)\n",
    "            projected_x = projected_x - projected_x.min(0)[0]\n",
    "            projected_x = 2 * (projected_x / projected_x.max(0)[0]) - 1\n",
    "        \n",
    "            mean_x = self.mean_module(projected_x)\n",
    "            covar_x = self.covar_module(projected_x)\n",
    "            return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "likelihood = gpytorch.likelihoods.GaussianLikelihood().cuda()\n",
    "model = GPRegressionModel(train_x, train_y, likelihood).cuda()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training the model\n",
    "\n",
    "The cell below trains the DKL model above, learning both the hyperparameters of the Gaussian process **and** the parameters of the neural network in an end-to-end fashion using Type-II MLE. We run 20 iterations of training using the `Adam` optimizer built in to PyTorch. With a decent GPU, this should only take a few seconds."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/jrg365/gpytorch/gpytorch/utils/cholesky.py:14: UserWarning: torch.potrf is deprecated in favour of torch.cholesky and will be removed in the next release. Please use torch.cholesky instead and note that the :attr:`upper` argument in torch.cholesky defaults to ``False``.\n",
      "  potrf_list = [sub_mat.potrf() for sub_mat in mat.view(-1, *mat.shape[-2:])]\n",
      "/home/jrg365/gpytorch/gpytorch/lazy/added_diag_lazy_tensor.py:66: UserWarning: torch.potrf is deprecated in favour of torch.cholesky and will be removed in the next release. Please use torch.cholesky instead and note that the :attr:`upper` argument in torch.cholesky defaults to ``False``.\n",
      "  ld_one = lr_flipped.potrf().diag().log().sum() * 2\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Iter 1/60 - Loss: 0.926\n",
      "Iter 2/60 - Loss: 0.922\n",
      "Iter 3/60 - Loss: 0.918\n",
      "Iter 4/60 - Loss: 0.913\n",
      "Iter 5/60 - Loss: 0.908\n",
      "Iter 6/60 - Loss: 0.906\n",
      "Iter 7/60 - Loss: 0.902\n",
      "Iter 8/60 - Loss: 0.895\n",
      "Iter 9/60 - Loss: 0.890\n",
      "Iter 10/60 - Loss: 0.885\n",
      "Iter 11/60 - Loss: 0.880\n",
      "Iter 12/60 - Loss: 0.875\n",
      "Iter 13/60 - Loss: 0.871\n",
      "Iter 14/60 - Loss: 0.866\n",
      "Iter 15/60 - Loss: 0.861\n",
      "Iter 16/60 - Loss: 0.856\n",
      "Iter 17/60 - Loss: 0.851\n",
      "Iter 18/60 - Loss: 0.846\n",
      "Iter 19/60 - Loss: 0.841\n",
      "Iter 20/60 - Loss: 0.836\n",
      "Iter 21/60 - Loss: 0.832\n",
      "Iter 22/60 - Loss: 0.827\n",
      "Iter 23/60 - Loss: 0.821\n",
      "Iter 24/60 - Loss: 0.816\n",
      "Iter 25/60 - Loss: 0.811\n",
      "Iter 26/60 - Loss: 0.806\n",
      "Iter 27/60 - Loss: 0.800\n",
      "Iter 28/60 - Loss: 0.795\n",
      "Iter 29/60 - Loss: 0.790\n",
      "Iter 30/60 - Loss: 0.785\n",
      "Iter 31/60 - Loss: 0.780\n",
      "Iter 32/60 - Loss: 0.774\n",
      "Iter 33/60 - Loss: 0.769\n",
      "Iter 34/60 - Loss: 0.764\n",
      "Iter 35/60 - Loss: 0.759\n",
      "Iter 36/60 - Loss: 0.754\n",
      "Iter 37/60 - Loss: 0.750\n",
      "Iter 38/60 - Loss: 0.745\n",
      "Iter 39/60 - Loss: 0.739\n",
      "Iter 40/60 - Loss: 0.734\n",
      "Iter 41/60 - Loss: 0.729\n",
      "Iter 42/60 - Loss: 0.724\n",
      "Iter 43/60 - Loss: 0.720\n",
      "Iter 44/60 - Loss: 0.715\n",
      "Iter 45/60 - Loss: 0.710\n",
      "Iter 46/60 - Loss: 0.705\n",
      "Iter 47/60 - Loss: 0.700\n",
      "Iter 48/60 - Loss: 0.695\n",
      "Iter 49/60 - Loss: 0.690\n",
      "Iter 50/60 - Loss: 0.685\n",
      "Iter 51/60 - Loss: 0.680\n",
      "Iter 52/60 - Loss: 0.675\n",
      "Iter 53/60 - Loss: 0.670\n",
      "Iter 54/60 - Loss: 0.666\n",
      "Iter 55/60 - Loss: 0.661\n",
      "Iter 56/60 - Loss: 0.656\n",
      "Iter 57/60 - Loss: 0.651\n",
      "Iter 58/60 - Loss: 0.645\n",
      "Iter 59/60 - Loss: 0.640\n",
      "Iter 60/60 - Loss: 0.635\n",
      "CPU times: user 11.3 s, sys: 3.11 s, total: 14.5 s\n",
      "Wall time: 14.4 s\n"
     ]
    }
   ],
   "source": [
    "# Find optimal model hyperparameters\n",
    "model.train()\n",
    "likelihood.train()\n",
    "\n",
    "# Use the adam optimizer\n",
    "optimizer = torch.optim.Adam([\n",
    "    {'params': model.feature_extractor.parameters()},\n",
    "    {'params': model.covar_module.parameters()},\n",
    "    {'params': model.mean_module.parameters()},\n",
    "    {'params': model.likelihood.parameters()},\n",
    "], lr=0.01)\n",
    "\n",
    "# \"Loss\" for GPs - the marginal log likelihood\n",
    "mll = gpytorch.mlls.ExactMarginalLogLikelihood(likelihood, model)\n",
    "\n",
    "training_iterations = 60\n",
    "def train():\n",
    "    for i in range(training_iterations):\n",
    "        # Zero backprop gradients\n",
    "        optimizer.zero_grad()\n",
    "        # Get output from model\n",
    "        output = model(train_x)\n",
    "        # Calc loss and backprop derivatives\n",
    "        loss = -mll(output, train_y)\n",
    "        loss.backward()\n",
    "        print('Iter %d/%d - Loss: %.3f' % (i + 1, training_iterations, loss.item()))\n",
    "        optimizer.step()\n",
    "        \n",
    "# See dkl_mnist.ipynb for explanation of this flag\n",
    "with gpytorch.settings.use_toeplitz(True):\n",
    "    %time train()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Making Predictions\n",
    "\n",
    "The next cell gets the predictive covariance for the test set (and also technically gets the predictive mean, stored in `preds.mean()`) using the standard SKI testing code, with no acceleration or precomputation. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/home/jrg365/gpytorch/gpytorch/utils/cholesky.py:14: UserWarning: torch.potrf is deprecated in favour of torch.cholesky and will be removed in the next release. Please use torch.cholesky instead and note that the :attr:`upper` argument in torch.cholesky defaults to ``False``.\n",
      "  potrf_list = [sub_mat.potrf() for sub_mat in mat.view(-1, *mat.shape[-2:])]\n",
      "/home/jrg365/gpytorch/gpytorch/lazy/added_diag_lazy_tensor.py:66: UserWarning: torch.potrf is deprecated in favour of torch.cholesky and will be removed in the next release. Please use torch.cholesky instead and note that the :attr:`upper` argument in torch.cholesky defaults to ``False``.\n",
      "  ld_one = lr_flipped.potrf().diag().log().sum() * 2\n"
     ]
    }
   ],
   "source": [
    "model.eval()\n",
    "likelihood.eval()\n",
    "with torch.no_grad(), gpytorch.settings.use_toeplitz(False), gpytorch.settings.fast_pred_var():\n",
    "    preds = model(test_x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Test MAE: 0.07841506600379944\n"
     ]
    }
   ],
   "source": [
    "print('Test MAE: {}'.format(torch.mean(torch.abs(preds.mean - test_y))))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
